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Abstract 

Background: Motif searching is an innportant step in tlie detection of rare events occurring in a set of DNA or protein 
sequences. One formulation of tine problem is known as (/,6/)-motif search or Planted Motif Search (PMS). In RMS we 
are given two integers / and d and n biological sequences. We want to find all sequences of length / that appear in 
each of the input sequences with at most d mismatches. The PMS problem is NP-complete. PMS algorithms are 
typically evaluated on certain instances considered challenging. Despite ample research in the area, a considerable 
performance gap exists because many state of the art algorithms have large runtimes even for moderately 
challenging instances. 

Results: This paper presents a fast exact parallel PMS algorithm called PMS8. PMS8 is the first algorithm to solve the 
challenging (l,d) instances (25, 10) and (26, 1 1). PMS8 is also efficient on instances with larger / and d such as (50,21). 
We include a comparison of PMS8 with several state of the art algorithms on multiple problem instances. This paper 
also presents necessary and sufficient conditions for 3 /-mers to have a common d-neighbor. The program is freely 
available at http://engr.uconn.edu/~man09004/PMS8/. 

Conclusions: We present PMS8, an efficient exact algorithm for Planted Motif Search. PMS8 introduces novel ideas 
for generating common neighborhoods. We have also implemented a parallel version for this algorithm. PMS8 can 
solve instances not solved by any previous algorithms. 



Background 

This paper presents an efficient exact parallel algorithm 
for the Planted Motif Search (PMS) problem also known 
as the (/, d) motif problem [1]. A string of length / is caller 
an /-men The number of positions where two /-mers u and 
V differ is called their Hamming distance and is denoted 
by Hd{Uy v). For any string T, T[i„j] is the substring of T 
starting at position / and ending at position The PMS 
problem is the following. Given n sequences 81,82, ... >Sn 
of length m each, from an alphabet E and two integers / 
and d, identify all /-mers M,M e that occur in at least 
one location in each of the n sequences with a Hamming 
distance of at most d. More formally, M is a motif if and 
only if V/, 1 < i < n, 3;'/, 1 < < m — / + 1, such that 
Hd(M,8i[Ji..ji^m-l])<d. 

The PMS problem is essentially the same as the Closest 
Substring problem. These problems have applications 
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in PGR primer design, genetic probe design, discover- 
ing potential drug targets, antisense drug design, finding 
unbiased consensus of a protein family, creating diag- 
nostic probes and motif finding (see e.g., [2]). Therefore, 
efficient algorithms for solving the PMS problem are very 
important in biology and bioinformatics. 

A PMS algorithm that finds all the motifs for a given 
input is called an exact algorithm. All known exact algo- 
rithms have an exponential worst case runtime because 
the PMS problem is NP-complete [2] . An exact algorithm 
can be built using two approaches. One is sample driven: 
for all (m — / + 1)^ possible combinations of /-mers coming 
from different strings, generate the common neighbor- 
hood. The other is pattern-driven: for all possible 
/-mers check which are motifs. Many algorithms employ 
a combination of these two techniques. For example, [3] 
and [4] generate the common neighbors for every pair of 
/-mers coming from two of the input strings. Every neigh- 
bor is then matched against the remaining n — 2 input 
strings to confirm or reject it as a motif. Other algorithms 
([5,6]) consider groups of three /-mers instead of two. 
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PMS algorithms are typically tested on instances gen- 
erated as follows (also see [1,4]): 20 DNA strings of 
length 600 are generated according to the i.i.d. (inde- 
pendent identically distributed) model A random /-mer 
is chosen as a motif and planted at a random loca- 
tion in each input string. Every planted instance is 
modified in at most d positions. For a given integer /, 
the instance (lyd) is defined to be challenging if d is 
the smallest integer for which the expected number 
of motifs of length / that occur in the input by ran- 
dom chance is > 1. Some of the challenging instances 
are (13, 4), (15, 5), (17, 6), (19, 7), (21, 8), (23, 9), (25, 10), 
(26, 11), etc. 

The largest challenging instance solved up to now has 
been (23, 9). To the best of our knowledge the only algo- 
rithm to solve (23, 9) has been qPMS7 [5]. The algorithm 
in [7] can solve instances with relatively large / (up to 48) 
provided that d is at most 1/4, However, most of the well 
known challenging instances have d > //4. PairMotif [3] 
can solve instances with larger /, such as (27, 9) or (30, 9), 
but these are significantly less challenging than (23, 9). 
Furthermore, for instances that current algorithms have 
been able to solve, the runtimes are often quite large. 

In this paper we propose a new exact algorithm, PMS8, 
which can efficiently solve both instances with large / and 
instances with large d. The efficiency of PMS8 comes 
mainly from reducing the search space by using the prun- 
ing conditions presented later in the paper, but also from 
a careful implementation which utilizes several speedup 
techniques and emphasizes cache locality. 

One of the basic steps employed in many PMS algo- 
rithms (such as PMSprune [4], PMS5 [8], PMS6 [9], and 
qPMS7 [5]) is that of computing all the common neigh- 
bors of three /-mers. In qPMS7, this problem is solved 
using an Integer Linear Programming (ILP) formulation. 
In particular, a large number of ILP instances are solved 
as a part of a preprocessing step and a table is populated. 
This table is then repeatedly looked up to identify com- 
mon neighbors of three /-mers. This preprocessing step 
takes a considerable amount of time and the lookup table 
calls for a large amount of memory. In this paper we offer a 
novel algorithm for computing all the common neighbors 
of three /-mers. This algorithm eliminates the preprocess- 
ing step. In particular, we don't solve any ILP instance. We 
also don't employ any lookup tables and hence we reduce 
the memory usage. We feel that this algorithm will find 
independent applications. Specifically, we state and prove 
necessary and sufficient conditions for 3 /-mers to have a 
common neighbor. 

Methods 

For any /-mer u we define its (i- neighborhood as the 
set of /-mers v for which Hd(u, v) < d. For any set of 
/-mers T we define the common /i-neighborhood of T as 



the intersection of the (^-neighborhoods of all /-mers in T. 
To compute common neighborhoods, a natural approach 
is to traverse the tree of all possible /-mers and iden- 
tify the common neighbors. A pseudocode is given in 
Appendix 1. A node at depth /c, which represents a /c-mer, 
is not explored deeper if certain pruning conditions are 
met. Thus, the better the pruning conditions are, the faster 
will be the algorithm. We discuss pruning conditions in a 
later section. 

PMS8 consists of a sample driven part followed by a 
pattern driven part. In the sample driven part we gen- 
erate tuples of /-mers originating from different strings. 
In the pattern driven part we generate the common 
(^-neighborhood of such tuples. Initially we build a matrix 
R of size n X (m — I -\- 1) where row / contains all the 
/-mers in 5/. We pick an /-mer x from row lofR and push 
it on a stack. We filter out any /-mer in 7? at a distance 
greater than 2d from x. Then we pick an /-mer from the 
second row of R and push it on the stack. We filter out 
any /-mer in R that does not have a common neighbor 
with the /-mers on the stack; then we repeat the process. 
If any row becomes empty, we discard the top of the stack, 
revert to the previous instance of R and try a different 
/-mer. If the stack size is above a certain threshold (see 
section on Memory and Runtime) we generate the com- 
mon (^-neighborhood of the /-mers on the stack. For each 
neighbor M we check whether there is at least one /-mer 
u in each row of R such that Hd(M, u) < d. If this is true 
then M is a motif. The pseudocode of PMS8 is given in 
Appendix 2. 

A necessary and sufficient condition for 3 /-mers to have 
a common neighbor is discussed in the section on pruning 
conditions. For 4 or more /-mers we only have necessary 
conditions, so we may generate tuples that will not lead 
to solutions. However, due to the way the filtering works, 
the following observations apply. Let the stack at any one 
time contain t /-mers: ri, r2, . . . , where Vt is at the top 
of the stack. Evidently, the /-mers on the stack pass the 
necessary pruning conditions for t /-mers. However, the 
following are also true. For any two /-mers Vi and rj on 
the stack, there exists a common neighbor. This is true 
because whenever we filter an /-mer we make sure it has 
at least one neighbor in common with the /-mer at the 
stop of the stack. Thus, we can prove by induction that 
any two /-mers in the stack have a common neighbor. Sec- 
ond, any triplet of the form ri, r2, 2 < i < t, has at 
least one common neighbor. This is true because when 
the stack had only the first two /-mers we filtered out 
any /-mer which doesn't make a compatible triplet with 
the two. In general, for any number p < t the (p -\- I) 
tuples of the form ri, r2, . . . , r^, where p < i < t 
must pass the pruning conditions for + 1 /-mers. There- 
fore, even though the pruning for more than 3 /-mers 
is not perfect, the algorithm implicitly tests the pruning 
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conditions on many subsets of the /-mers in the stack 
and thus decreases the number of false positive tuples 
generated. 

Speedup techniques 
Sort rows by size 

An important speedup technique is to reorder the rows of 
R by size after every filtering step. This reduces the num- 
ber of tuples that we consider at lower stack sizes. These 
tuples require the most expensive filtering because as the 
stack size increases, fewer /-mers remain to be filtered. 

Compress l-mers 

We can speed up Hamming distance operations by com- 
pressing all the l-mers of R in advance. For example, for 
DNA we store 8 characters in a 16 bit integer, divided 
into 8 groups of 2 bits. For every 16 bit integer / we store 
in a table the number of non-zero groups of bits in /. To 
compute the Hamming distance between two /-mers we 
first perform an exclusive or of their compressed repre- 
sentations. Equal characters produce bits of 0, different 
characters produce non-zero bits. Therefore, one table 
lookup provides the Hamming distance for 8 characters. 
One compressed /-mer requires /* [log | E H bits of storage. 
However, we only need the first 16 bits of this representa- 
tion because the next 16 bits are the same as the first 16 
bits of the /-mer 8 positions to the right of the current one. 
Therefore, the table of compressed /-mers only requires 
0(n(m — / + 1)) words of memory. 

Preprocess distances for pairs of l-mers 

The filtering step tests many times if two /-mers have a 
distance of no more than 2d, Thus, for every pair of /-mers 
we compute this bit of information in advance. 

Cache locality 

We can update R in an efficient manner as follows. Every 
row in the updated matrix R^ is a subset of the correspond- 
ing row in the current matrix 7?, because some elements 
will be filtered out. Therefore, we can store R^ in the same 
memory locations as R. To do this, in each row, we move 
the elements belonging to R^ at the beginning of the row. 
In addition, we keep track of how many elements belong to 
R\ To revert from R^ back to 7?, we restore the row sizes to 
their previous values. The row elements will be the same 
even if they have been moved within the row. The same 
process can be repeated at every step of the recursion, 
therefore the whole "stack" of R matrices is stored in a 
single matrix. This reduces the memory requirement and 
improves cache locality. The cache locality is improved 
because at every step of the recursion, in each row, we 
access a subset of the elements we accessed in the previ- 
ous step, and those elements are in contiguous locations 
of memory. 



Find motifs for a subset of the strings 

We use the speedup technique described in [10]: compute 
the motifs for W < noi the input strings, then test each of 
them against the remaining n—w^ strings. For the results in 
this paper n' was heuristically computed using the formula 
provided in Appendix 4. 

Memory and Runtime 

Since we store all matrices R in the space of a single matrix 
they only require 0(n(m— /+1)) words of memory. To this 
we add 0{n^) words to store row sizes for the at most n 
matrices which share the same space. The bits of informa- 
tion for compatible /-mer pairs take 0((n(m — / + l))^/w) 
words, where w is the number of bits in a machine word. 
The table of compressed /-mers takes 0(n(m — / + 1)) 
words. Therefore, the total memory used by the algorithm 
is 0(n(n + m - / + 1) + (n(m - I + l)f/w). 

The more time we spend in the sample driven part, the 
less time we have to spend in the pattern driven part and 
vice-versa. Ideally we want to choose the threshold where 
we switch between the two parts such that their runtimes 
are almost equal. The optimal threshold can be deter- 
mined empirically by running the algorithm on a small 
subset of the tuples. In practice, PMS8 heuristically esti- 
mates the threshold t such that it increases with d and 
I E I to avoid generating very large neighborhoods and it 
decreases with m to avoid spending too much time on fil- 
tering. All the results reported in this paper have been 
obtained using the default threshold estimation provided 
in Appendix 4. 

Parallel implementation 

We can think of m — / + 1 independent sub problems, one 
for each /-mer in the first string. The first string in each 
sub problem is an /-mer of the original first string and the 
rest of the strings are the same as in the original input. 
Because of this, the problem seems to be "embarrassingly 
parallel." A straightforward parallelization idea is to assign 
an equal number of subproblems to each processor. This 
method has the advantage that no inter-processor com- 
munication is necessary beyond broadcasting the input 
to all processors. This method would work well for algo- 
rithms where each subproblem is expected to have a 
similar runtime. However, for PMS8, the runtime of each 
subproblem is very sensitive to the size of the neighbor- 
hoods for various combinations of /-mers and therefore 
some processors may end up starving while others are still 
busy. 

To alleviate the above shortcoming we employ the fol- 
lowing strategy. The processor with rank 0 is a scheduler 
and the others are workers. The scheduler spawns a sep- 
arate worker thread to avoid using one processor just for 
scheduling. The scheduler reads the input and broadcasts 
it to all workers. Then each worker requests a sub problem 
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Figure 1 Proof of theorem 1, case 1. Proof of theorem 1, case 1 : There exists /, 1 < / < 3 for which n,- > dj. Without loss of generality we assume 
j = ]. The top 3 rows represent the input /-mers. The last row shows a common neighbor M. In any column, identical colors represents matches, 
different colors represent mismatches. 



from the scheduler, solves it and repeats. The scheduler 
loops until all jobs have been requested and all workers 
have been notified that no more jobs are available. At the 
end, all processors send their motifs to the scheduler. The 
scheduler loops through all the processors and collects the 
results. The scheduler then outputs the results. 

Pruning conditions 

In this section we present pruning conditions applied for 
filtering /-mers in the sample driven part and for pruning 
enumeration trees in the pattern driven part. 

Two /-mers a and b have a common neighbor M such 
that Hd{a,M) < da and Hd{b,M) < dy if and only if 
Hd{a, b) < da -\- dy. For 3 /-mers, no trivial necessary 
and sufficient conditions have been known up to now. In 
[8] sufficient conditions for 3 /-mers are obtained from 
a preprocessed table. However, as / increases the mem- 
ory requirement of the table becomes a bottleneck. We 
will give simple necessary and sufficient conditions for 3 
/-mers to have a common neighbor. These conditions are 
also necessary for more than 3 /-mers. 

Let r be a set of /-mers and M be an /-mer. If 
^^^jHdiMyU) > I r|<i then, by the pigeonhole princi- 
ple, one /-mer must have a distance from M greater than 
d. Therefore, M cannot be a common neighbor of the 
/-mers in T, If we have a lower bound on ^^^j Hd{My u) 



for any M, then we can use it as a pruning condition. If the 
lower bound is greater than \ T\d then there is no common 
neighbor for T, One such lower bound is the consensus 
total distance. 

Definition 1. Let T be a set ofl-mers, where k = \T\. For 
every i, the set Ti [ i] , /],.., Tj^[ i] is called the i-th column 
of T, Let mi be the maximum frequency of any character 



in column i. Then Cd{T) = J2h 
consensus total distance ofT. 



mi is called the 



The consensus total distance is a lower bound for the 
total distance between any /-mer M and the /-mers in 
T because, regardless of M, the distance contributed by 
column / to the total distance is at least k — m/. The 
consensus total distance for a set of two /-mers A and B 
will be denoted by Cd(A,B), Also notice that Cd(A,B) = 
Hd(A, B), We can easily prove the following lemma. 

Lemma 1. Let T be a set of l-mers and k = \T\. Let 
d\,d2y... dk be non-negative integers. There exists a l-mer 
Msuch thatHd(M, Td < du'ii, only ifCd(T) < l^-^^di. 

Theorem 1. Let T be a set of 3 l-mers and di, d^, d^ be 

non-negative integers. There exists a l-mer M such that 
Hd{M, Ti) < <i/, V/, 1 < / < 3 if and only if the following 
conditions hold: 



Til 
T2| 
T3| 

M I 



nO 



nl 



n2 



n3 



n4 



^nl+n4-dl^ ^n2+n4-cl2-^ ^n3+n4-d3^ 



Figure 2 Proof of theorem 1, case 2. Proof of theorem 1, case 2: n,- < dj for all /, 1 < / < 3. The top 3 rows represent the input /-mers. The last row 
shows a common neighbor M. In any column, identical colors represents matches, different colors represent mismatches. 
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Figure 3 Speedup of the parallel implementation over the single 
core version. Speedup of the multi-core version of PMS8 over tlie 
single core version, for several datasets. 



i) Cd(Ti, Tj) < di + dj, WiJ, 1 < / < ; < 3 

ii) Cd{T) < di -\- d2 + ds 

Proof The "only if" part follows from lemma 1. 

For the "if" part we show how to construct a common 
neighbor M provided that the conditions hold. We say that 
a column k where Ti[k] = T2[k] = T^lk] is of type Nq, 
If Ti[k] ^ T2[k] = T^lk] then the column is of type A^i. 
If Ti [k] = T3 [k] ^ T2 [k] the column is of type N2 and if 
Ti[k] = T2[k] ^ T3[k] then the column is of type TVs- If all 
three characters in the column are distinct, the column is 



of type A/4. Let n/, V/, 0 < / < 4 be the number of columns 
of type Ni, Consider two cases: 

Case 1) There exists /, 1 < / < 3 for which ni > di. We 
construct M as illustrated in Figure 1. Pick di columns of 
type Hi, For each chosen column k set M[k] = Tj[k] where 
; 7^ /. For all other columns set M[k] = Ti[k], Therefore 
Cd(Ti,M) = di. For ; 7^ / we know that Cd(Ti, Tj) < 
di + dj from condition i) (condition i is assumed to be 
true at this point because we are proving the "if" part). We 
also know that Cd(Ti,M) + Cd(M, Tj) < Cd(Ti, Tj) from 
the triangle inequality. It follows that Cd{My Tj) < dj. 
Since Cd{M, Tj) = Hd(M, Tj) it means that M is indeed a 
common neighbor of the three /-mers. 

Case 2) For all 1 < / < 3 we have Hi < di. We 
construct M as shown in Figure 2. For columns k of type 
A/o, A/2 and N3 we set M[k] = Ti[k], For columns of type 
Ni we set M[k] = T2[k], For any 1 < / < 3 the follow- 
ing applies. If Hi + ^4 < di then the Hamming distance 
between M and Ti is less than di regardless of what char- 
acters we choose for M in the columns of type A/4. On 
the other hand, if Hi + ^4 > di then M and Ti have 
to match in at least Hi + ^4 — di columns of type A/4. 
Thus, we pick max(0, rii + ^4 — di) columns of type A/4 
and for each such column k we set M[k] = Ti[k], Now 
we prove that we actually have enough columns to make 
the above choices, in other words l^f^^max(0, rii + ^4 — 
di) < ^4. This is equivalent to the following conditions 
being true: 

a) For any /, 1 < / < 3 we want Hi + ^4 — di < n^. This 
is true because rii < di. 

b) For any 1 < / < y < 3 we want 

{rii + H4 — di) + {rij + ^4 — dj) < n^. This can be 



IPMS8 ■ PMS8 without sorting 



^ 4000 



.ll 



(15,5) 



(17,6) 



(19,7) 
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Figure 4 The effect of the "sort rows by size" speedup on runtime. Comparison between PMS8 and PMS8 without sorting rows by size (see the 
section on Speedup techniques). Note that, in both versions, all the other speedups are enabled. The runtimes are for single core execution, 
averaged over 5 random instances. 
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21.3m 












50 


Is 


Is 


Is 


Is 


Is 


Is 


Is 


Is 


Is 


Is 


Is 


Is 


Is 


2s 


9s 


2.5m 


4m 


4.61h 












4 


5 


6 


7 


8 


9 


10 


11 


12 


13 


14 


15 


16 


17 


18 


19 


20 


21 


22 


23 


24 


25 



I Time on single core | | Time on 48 cores | | Not solved yet | | More than 500 spurious motifs | 



Figure 5 PMSS runtimes for multiple combinations of I and d. PMS8 runtimes for datasets with / up to 50 and d up to 25 averaged over 5 
random datasets. Wliite bacl<ground signifies single core execution. Blue background signifies execution using 48 cores. Instances in gray have 
more than 500 spurious motifs. Orange cells indicate unsolved instances. Time is reported in hours (h), minutes (m) and seconds (s). 



rewritten as ni + nj -\- < di + dj. The left hand side 
is Hd(Ti, Tj) which we know is less or equal to di + dj. 
c) We want E nt + ^4 — di < ^4. This can be rewritten 
as ^1+^2 + ^3+ 2^4 < di -\- d2 -\- d'^. The left hand 
side is Cd(T) which we know is less than di-\-d2-\-d'^. 

□ 

One of our reviewers kindly pointed out that the above 
proof is similar to an algorithm in [11]. 

Results and discussion 

PMSS is implemented in C++ and uses OpenMPI for 
communication between processors. PMSS was evaluated 
on the Hornet cluster in the Booth Engineering Center 
for Advanced Technology (BECAT) at University of Con- 
necticut. The Hornet cluster consists of 64 nodes, each 
equipped with 12 Intel Xeon X5650 Westmere cores and 
4S GB of RAM. The nodes use Infiniband networking for 
MPI. In our experiments we employed at most 4S cores on 
at most 4 nodes. 

We generated random (/, <i) instances according to [1] 
and as described in the introduction. For every (/, <i) 



combination we report the average runtime over 5 ran- 
dom instances. For several challenging instances, in 
Figure 3 we present the speedup obtained by the paral- 
lel version over the single core version. For p = 48 cores 
the speedup is close to 5' = 45 and thus the efficiency is 
E = S/p = 94%. 



Table 1 Comparison between qPMS7 and PMSS 



Instance 


qPMS7 


PMS8^ 


PMS8^^ 




PMSS^s 


(13,4) 


29s 


7s 


3s 


2s 


2s 


(15,5) 


2.1m 


48s 


5s 


4s 


3s 


(17,6) 


10.3m 


5.2m 


22s 


12s 


9s 


(19,7) 


54.6m 


26.6m 


1.7m 


52s 


37s 


(21,8) 


4.87h 


1.64h 


6.5m 


3.3m 


2.2m 


(23,9) 


27.09h 


5.48h 


21.1m 


10.7m 


7.4m 


(25,10) 




15.45h 


1.01 h 


30.4m 


20.7m 


(26,11) 










46.9h 



Comparison between qPMS7 and PMSS on challenging instances. PMSS^ means 
PMSS used P CPU cores. Both programs have been executed on the same 
hardware and the same datasets. The times are average runtimes over 5 
instances for each dataset. 
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are solved efficiently using a single CPU core. For more 
challenging instances we report the time taken using 48 
cores. 

A comparison between PMS8 and qPMS7 [5] on chal- 
lenging instances is shown in Table 1. Both programs 
have been executed on the Hornet cluster. qPMS7 is a 
sequential algorithm. PMS8 was evaluated using up to 48 
cores. The speedup of PMS8 single core over qPMS7 is 
shown in Figure 6. The speedup is high for small instances 
because qPMS7 has to load an ILP table. For larger 
instances the speedup of PMS8 sharply increases. This 
is expected because qPMS7 always generates neighbor- 
hoods for tuples of 3 /-mers, which become very large as / 
and d grow. On the other hand, PMS8 increases the num- 
ber of /-mers in the tuple with the instance size. With each 
/-mer added to the tuple, the size of the neighborhood 
reduces exponentially, whereas the number of neighbor- 
hoods generated increases by a linear factor. The ILP table 
precomputation requires solving many ILP formulations. 
The table then makes qPMS7 less memory efficient than 
PMS8. The peak memory used by qPMS7 for the challeng- 
ing instances in Table 1 was 607 MB whereas for PMS8 
it was 122 MB. Furthermore, due to the size of the ILP 
table, qPMS7 is not able to solve any instances where 
/ > 23. PMS8 is the first algorithm to solve the challenging 
instances (25,10) and (26,11). 

Some recent results in the literature have also focused 
on instances other than the challenging ones presented 
above. A summary of these results and a comparison 
with PMS8 is presented in Table 2, starting with the 
most recent results. These results have been obtained on 



Table 2 Comparison between PMS8 and recent results in the literature 


Previous algorithm 


Instance 


Time 


Cores 


PMSS time 


PMSS cores 


Abbas et al. 201 2 [1 2], PHEP_PMSprune 


(21,8) 


20.42h 


8 


6.5m 




Yu etal.2012[3],PairMotif 


(27, 9) 


lOh 


1 


4s 




Desaraju and Mukkamala 201 1 [7] 


(24,6) 


347s 


1 


Is 




(48,12) 


188s 


1 


Is 




Dasari et al. 201 1 [1 3], mSPELLER / gSPELLER 


(21,8) 


3.7h 


16 


6.5m 


16 


(21,8) 


2.2h 


4GPUsx240 cores 


6.5m 


16 


Dasari et al. 201 0 [14], BitBased 


(21,8) 


l.lh 


6.5m 


16 


Dasari and Desh 201 0 [1 5], BitBased 


(21,8) 


6.9h 


16 


6.5m 


16 


Sahooetal.2011 [16] 


(16,4) 


106s 


4 


Is 




Sun etal. 2011 [1 7],TreeMotif 


(40,14) 


6h 


1 


6s 




Heetal.2010[18],ListMotif 


(40,14) 


28,087s 


1 


6s 




Faheem 201 0 [1 9], skip-Brute Force 


(15,4) 


2934s 


96 nodes 


Is 






(24,8) 


4h 


1 


5s 




Ho etal. 2009 [6],iTriplet 


(38,12) 


Ih 


1 


Is 






(40,12) 


5m 


1 


Is 





Side by side comparison between PMSS and recent results in the literature. Time is reported in seconds (s), minutes (m) or hours (h). Note that the hardware is 
different, though we tried to match the number of processors when possible. Also, the instances are randomly generated using the same algorithm, however the 
actual instances used by the various papers are most likely different. For PMSS, the times are averages over 5 randomly generated instances. 




(13,4) (15,5) (17,6) (19,7) (21,8) (23,9) 



Instance 

Figure 6 Speedup of PMSS single core over qPIVlS7. Ratio of 
runtimes between qPMS7 and PMS8 running on a single core. Both 
programs have been executed on the same hardware and the same 
datasets. The times are average runtimes over 5 instances for each 
dataset. 



In Figure 4 we show the effect of the first speedup tech- 
nique (sorthing rows by size) on the runtime. Note that 
all other speedups are enabled, only sorting rows by size 
is varied. The figure shows that sorting the rows by size 
improves the speed of PMSS by 25% to 50%. 

The runtime of PMSS on instances with / up to 50 and 
<i up to 21 is shown in Figure 5. Instances which are 
expected to have more than 500 motifs simply by ran- 
dom chance (spurious motifs) are excluded. The expected 
number of spurious motifs was computed as described 
in Appendix 3. Instances where d is small relative to / 
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Table 3 Runtime comparison between PMS8 and qPMS? on real datasets f rom [20] 



Dataset 


n 


Total no. bases 


1 


d 


PMS8 time 


qPMS7time 


dmOIr 


4 


6000 


21 


4 


1 


55 


dmOIr 


4 


6000 


23 


5 


1 


6 


dm04r 


4 


8000 


21 


4 


1 


5 


dnn04r 


4 


8000 


23 


5 


1 


5 


hmOIr 


18 


36000 


21 


6 


10 


14 


hmOIr 


18 


36000 


23 


7 


25 


40 


hm02r 


9 


9000 


21 


6 


1 


11 


hm02r 


9 


9000 


23 


7 


4 


35 


hmOSr 


10 


15000 


21 


6 


3 


24 


hmOBr 


10 


15000 


23 


7 


14 


146 


hm04r 


13 


26000 


21 


6 


6 


44 


hm04r 


13 


26000 


23 


7 


15 


39 


hmOSr 


3 


3000 


21 


4 


1 


6 


hm05r 


3 


3000 


23 


5 


1 


46 


hm08r 


15 


7500 


17 


5 


1 


7 


hmOSr 


15 


7500 


17 


6 


46 


251 


hnnl9r 


5 


2500 


23 


5 


1 


5 


hnn19r 


5 


2500 


23 


6 


1 


5 


hm20r 


35 


70000 


21 


6 


27 


32 


hm20r 


35 


70000 


23 


7 


56 


136 


hnn26r 


9 


9000 


23 


6 


1 


5 


hnn26r 


9 


9000 


23 


7 


5 


46 


mus02r 


9 


9000 


21 


6 


1 


11 


mus02r 


9 


9000 


23 


7 


2 


45 


mus04r 


7 


7000 


21 


6 


1 


15 


nnus04r 


7 


7000 


23 


7 


2 


22 


musOSr 


4 


2000 


21 


5 


1 


79 


musOSr 


4 


2000 


23 


6 


1 


5 


nnus07r 


4 


6000 


21 


5 


1 


79 


nnus07r 


4 


6000 


23 


5 


1 


6 


muslOr 


13 


13000 


21 


6 


2 


56 


muslOr 


13 


13000 


23 


7 


2 


70 


musi 1 r 


12 


6000 


21 


7 


8 


150 


musi 1 r 


12 


6000 


23 


8 


23 


938 


ystOIr 


9 


9000 


21 


6 


2 


14 


ystOIr 


9 


9000 


23 


7 


8 


63 


yst02r 


4 


2000 


21 


5 


1 


5 


yst02r 


4 


2000 


23 


6 




6 


ystOSr 


8 


4000 


21 


6 




8 


yst03r 


8 


4000 


23 


7 




19 


yst04r 


6 


6000 


21 


4 




5 


yst04r 


6 


6000 


23 


5 




5 


ystOSr 


3 


1500 


21 


4 




5 


ystOSr 


3 


1500 


23 


5 




5 
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Table 3 Runtime comparison between PMS8 and qPMS? on real datasets f rom [20] (Continued) 



yst06r 


7 


3500 


21 


6 


1 


6 


yst06r 


7 


3500 


23 


7 


2 


12 


ystOSr 


11 


11000 


21 


5 


1 


6 


ystOSr 


11 


11000 


23 


6 


1 


6 


yst09r 


16 


16000 


21 


6 


2 


17 


yst09r 


16 


16000 


23 


7 


6 


68 



For each dataset we tested two combinations of / and d. For qPMS? we set q = n. Both algorithms were executed on a single CPU core. Time is reported in seconds, 
rounded up to the next second. 



various types of hardware: single core, multi-core, GPU, 
grid. In the comparison, we try to match the number 
of processors whenever possible. The speed difference 
is of several orders of magnitude in some cases which 
indicates that the pruning conditions employed by PMS8 
exponentially reduce the search space compared to other 
algorithms. 

We also compared PMS8 with qPMS7 on the real 
datasets discussed in [20] . We excluded datasets with less 
than 4 input sequences because these are not very chal- 
lenging. For each dataset we chose two combinations of 
/ and d. These combinations were chosen on a dataset 
basis because for large values of d the number of reported 
motifs is excessive and for small values of d the instance is 
not very challenging. To make qPMS7 behave like PMS8 
we set the quorum percent to 100% {q = n). In Table 3 we 
report the dataset name, the total number of sequences, 
the total number of bases in each dataset, the / and d com- 
bination and the runtimes of the two algorithms. Note that 
both algorithms are exact algorithms and therefore the 
sensitivity and specificity are the same. Similar to the com- 
parison on synthetic data, the comparison on real data 
reveals that PMS8 outperforms qPMS7. 

Conclusions 

We have presented PMS8, an efficient algorithm for the 
PMS problem. PMS8 is able to efficiently generate neigh- 
borhoods for t /-mers at a time, by using the pruning 
conditions presented in this paper. Previous algorithms 
generate neighborhoods for only up to three /-mers at 
a time whereas in PMS8 the value of t is increased as 
the instances become more challenging and therefore the 
exponential explosion is postponed. The second reason 
for the efficiency of PMS8 comes from the careful imple- 
mentation which employs several speedup techniques and 
emphasizes cache locality. 

Appendix 

Appendix 1 Generating neighborhoods 

Algorithm !♦ GenerateNeighborhood(r, d) 
for (/ = l..|r|) do Vi := d; 
GenerateNeighborhood(r, r, 1) 



GenerateNeighborhood(r, ryf) 
if ip < I) then 

if (not prune(r, r)) then 
for Of G E do 

for (/= l..|r|)do 

r; := Ti[2..\Si\] 

r[ := n; 

if{Ti[0]^a) then := /.-I; 
end for 

GenerateNeighborhood(r^ + 1) 
end for 
end if 
else 

report /-mer x 
end if 

Appendix 2 PMS8 pseudocode 
Algorithm 2. PMS8(r, d) 

for (/ = do Ri = {u\u e Si} 
stack = {} 

GenerateMotifs(l, stacks R) 

GenerateMotifs( stack, R) 
for (u e Rp) do 
stack.push(w) 
R' :=mter{R, stack) 
if (7?'.size> 0) then 

if (ThresholdCondition) then 

N :=GenerateNeighborhood(stack,<i) 
for {m e N) do 

if (isMotif(m, R')) then output m; 

else 

GenerateMotifs(/7 + 1,7?) 
stack.popO 
end for 



Appendix 3 Challenging instances 

For a fixed /, as d increases, the instance becomes more 
challenging. However, as d increases, the number of false 
positives also increases, because many motifs will appear 
simply by random chance. The expected number of 
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spurious motifs in a random instance can be estimated as 
follows (see e.g., [4]). The number of /-mers in the neigh- 
borhood ofa given /-mer Mis A/^(E,/,^) = E;^o(^)(|S| - 
1)^. The probability that M is a <i-neighbor of a random 
l-mer is p(T^J,d) = A/^(E, /, |^. The probability that 
M has at least one <i-neighbor among the /-mers of a string 
oflength mis thus (^(m,E,/,^) = l-(l-p{^J,d))'^-^^\ 
The probability that M has at least one <i-neighbor in each 
of n random strings of length m is q(m, S, /, <i)^. Finally, 
the expected number of spurious motifs in an instance 
with n strings of length m each is: |E E, /, 

In this paper we consider all combinations of / and d 
where / is at most 50 and the number of spurious motifs 
(expected by random chance) does not exceed 500. Note 
that for a fixed d, if we can solve instance (l,d) we 
can also solve all instances {l^,d) where f > /, because 
they are less challenging than (/, d). 

Appendix 4 Heuristics for t and n 

In the methods section we mentioned that we heuristi- 
cally estimate the threshold t at which we switch from the 
pattern driven to the sample driven part. The exact for- 
mula used by the algorithm to compute t was t = max 
(2, Ly2(^+l)logE -logmj). This follows the intuition 
that t should increase with E to avoid large neighbor- 
hoods and decrease with m to avoid spending too much 
time on filtering. 

In the Speedup techniques section we mentioned a 
speedup where we compute motifs for a subset oin' < n 
strings. By default, the algorithm heuristically computes 

as = min{n, t -\- n/^ — logt). These simple heuristics 
worked well enough on all our test cases, however the user 
can easily override them. 
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